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Abstract. Keyword search in relational databases has been widely studied in 
recent years because it does not require users neither to master a certain struc- 
tured query language nor to know the complex underlying database schemas. 
Most of existing methods focus on answering snapshot keyword queries in static 
databases. In practice, however, databases are updated frequently, and users may 
have long-term interests on specific topics. To deal with such a situation, it is 
necessary to build effective and efficient facility in a database system to support 
continual keyword queries. 

In this paper, we propose an efficient method for answering continual top-/: key- 
word queries over relational databases. The proposed method is built on an ex- 
isting scheme of keyword search on relational data streams, but incorporates the 
ranking mechanisms into the query processing methods and makes two improve- 
ments to support efficient top-A; keyword search in relational databases. Compared 
to the existing methods, our method is more efficient both in computing the top-fc 
results in a static database and in maintaining the top-/: results when the database 
continually being updated. Experimental results validate the effectiveness and ef- 
ficiency of the proposed method. 

Key words: Relational databases, keyword search, continual queries, results mainte- 
nance. 

1 Introduction 

With the proliferation of text data available in relational databases, simple ways to ex- 
ploring such information effectively are of increasing importance. Keyword search in 
relational databases, with which a user specifies his/her information need by a set of 
keywords, is a popular information retrieval method because the user needs to know 
neither a complex query language nor the underlying database schemas. It has attracted 
substantial research effort in recent years, and a number of methods have been devel- 
oped II1I2I3I4I5I6I7I8I9I10II . 

Example 1. Consider a sample publication database shown in Fig. [T] Fig.[T](a) shows 
the three relations Papers, Authors, and Writes. In the following, we use the initial of 
each relation name {P, A, and W) as its shorthand. There are two foreign key references: 
W — > A and W — > f. Fig.[T](b) illustrates the tuple connections based on the foreign key 
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references. For the keyword query "James P2P" consisting of two keywords "James" 
and "P2P", there are six tuples in the database that contain at least one of the two 
keywords (underlined in Fig.[T](a)). They can be regraded as the results of the query. 
However, they can be joined with other tuples according to the foreign key references 
to form more meaningful results, several of which are shown in Fig.[T](c). The arrows 
represent the foreign key references between the corresponding pairs of tuples. Finding 
such results which are formed by the tuples containing the keywords is the task of 
keyword search in relational databases. As described later, results are often ranked by 
relevance scores evaluated by a certain ranking strategy. □ 
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(c) Examples of query results 
Fig. 1. A sample database with a keyword query "James P2P". 



Most of the existing keyword search methods assume that the databases are static 
and focus on answering snapshot keyword queries. In practice, however, a database is 
often updated frequently, and the result of a snapshot query becomes invalid once the 
related data in the database is updated. For the database in Fig. [T] if publication data 
comes continually, new publication records are inserted to the three tables. Such new 
records may be more relevant to "James" and "P2P". Hence, after getting the initial top- 
k results, the user may demand the top-k results to reflect the latest database updates. 



Scalable Continual Top-k Keyword Search in Relational Databases 



3 



Such demands are common in real applications. Suppose a user want to do a top-^ 
keyword search in a Micro-blogging database, which is being updated continually: not 
only the weblogs and comments are continually being inserted or deleted by bloggers, 
but also the follow relationship between bloggers are being updated continually. Thus, 
a continual evaluation facility for keyword queries is essential in such databases. 

For continual keyword query evaluation, when the database is updated, two situa- 
tions must be considered: 

1. Database updates may change the existing top-k results: some top-A: results may be 
replaced by new ones that are related to the new tuples, and some top-A: results may 
be invalid due to deletions. 

2. Database updates may change the relevance scores of existing results because the 
underlying statistics (e.g., word frequencies) are changed. 

In this paper, we describe a system which can efficiently report the top-k results of 
every monitoring query while the database is being updated continually. The outline of 
the system is as follows: 

- When a continual query is issued, it is evaluated in a pipelined way to find the set 
of results whose upper bounds of relevance scores are higher than a threshold 6 by 
calculating the upper bound of the future relevance score for every query result. 

- When the database is updated, we first update the relevance scores of the computed 
results, then find the new results whose upper bounds of relevance scores are larger 
than 6 and delete the results containing the deleted tuples. 

- The pipelined evaluation of the keyword query is resumed if the number of com- 
puted results whose relevance scores are larger than 6 falls below k, or is reversed 
if the above number is much larger than k. 

- At any time, the k computed results whose relevance scores are the largest and are 
larger than 6 are reported as the top-A; results. 

In Section [2j some basic concepts are introduced and the problem is defined. Sec- 
tion |3] discusses related work. Section]?] presents the details of the proposed method. 
Section]5]gives the experimental results. Conclusion is drawn in Section]6] 

2 Preliminaries 

In this section, we introduce some important concepts for top-A; keyword querying eval- 
uation in relational databases. 

2.1 Relational Database Model 

We consider a relational database schema as a directed graph Gs {V, E), called a schema 
graph, where V represents the set of relation schemas {Ri,R2, ■ ■ ■} and E represents the 
foreign key references between pairs of relation schemas. Given two relation schemas. 
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Ri and Rj, there exists an edge in the schema graph, from Rj to /?,-, denoted <— Rj, 
if the primary key of Rj is referenced by the foreign key defined on Rj. For example, 
the schema graph of the pubHcation database in Fig. [Tjis Papers «— Write — > Authors. 
A relation on relation schema R, is an instance of R, (a set of tuples) conforming to 
the schema, denoted r(/?,). A tuple can be inserted into a relation. Below, we use R, to 
denote r(Ri) if the context is obvious. 

2.2 Joint-T\iple-Trees (JTTs) 

The results of keyword queries in relational databases are a set of connected trees of 
tuples, each of which is called a joint-tuple-tree {JTT for short). A JTT represents how 
the matched tuples, which contain the specified keywords in their text attributes, are 
interconnected through foreign key references. Two adjacent tuples of a JTT, f,- e r(Ri) 
and tj 6 r(Rj), are interconnected if they can be joined based on a foreign key reference 
defined on relational schema Rj and Rj in Gs (either Rj Rj or Rj <— Rj). The foreign 
key references between tuples in a JTT can be denoted using arrows or notation M. For 
example, the second JTT in Fig. [TJc) can be denoted as fli <— ifi — » p2 0Tai M wi N p2. 
To be a valid result of a keyword query Q, each leaf of a JTT is required to contain at 
least one keyword of Q. In Fig.[TJc), tuples pi, p2, fli and 03 are matched tuples to the 
keyword query as they contain the keywords. Hence, the four JTTs are valid results to 
the query. In contrast, pi <— 1112 — > fl2 is not a valid result because tuple 02 does not 
contain any required keywords. The number of tuples in a JTT T is called the size of T, 
denoted by size(T). 

2.3 Candidate Networks (CNs) 

Given a keyword query Q, the query tuple set Rf of relation Rj is defined as Rf - 
{t e r(Ri) I f contains some keywords of Q}. For example, the two query tuple sets in 
Example [1] are - {pi,p2,P5} and = {a[,aj,,a5}, respectively. The free tuple set 
Rf of a relation R, with respect to Q is defined as the set of tuples that do not contain 
any keywords of Q. In Example[l] P^ - {p^, p4, . . .), - {a2, fl4, . . •)■ If a relation Rj 
does not contain text attributes (e.g., relation W in Fig. Rj is used to denote Rf for 
any keyword query. We use pf"'^^ to denote a tuple set, which may be either Pf or Pf. 

Each JTT belongs to the result of a relational algebra expression, which is called a 
candidate network (CN) II4I9I1 II . A CN is obtained by replacing each tuple in a JTT 
with the corresponding tuple set that it belongs to. Hence, a CN corresponds to a join 
expression on tuple sets that produces JTTs as results, where each join clause pf"''' IX 
rJ"'^^ corresponds to an edge {Rj,Rj) in the schema graph Gs, where 1X1 represents a 
equi-join between relations. For example, the CNs that correspond to two JTTs po and 
P2 If 1 — > fli in Example[r|are and P^ MW M A^, respectively. In the following, 
we also denote ixi W ixi as <_ ^ ^ ^.s the leaf nodes of JTTs must be 
matched tuples, the leaf nodes of CNs must be query tuple sets. Due to the existence 
of m : n relationships (for example, an article may be written by multiple authors), a 
CN may have multiple occurrences of the same tuple set. The size of CN C, denoted as 
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size(C), is defined as the number of tuple sets that it contains. Obviously, the size of a 
CN is the same as that of the JTTs it produces. Fig. [2] shows the CNs corTesponding to 
the four JTTs shown in Fig.[T](c). A CN can be easily transformed into an equivalent 
SQL statement and executed by an RDBMSQ 




Fig. 2. Examples of Candidate Networks 



When a continual keyword query Q - {wi, W2, ■ ■ ■ ,wi} is specified, the non-empty 
query tuple set for each relation /?, in the target database are firstly computed us- 
ing full-text indices. Then all the non-empty query tuple sets and the database schema 
are used to generate the set of valid CNs, whose basic idea is to expand each partial 
CN by adding a Rf or Rf at each step (R, is adjacent to one relation of the partial CN 
in Gs), beginning from the set of non-empty query tuple sets. The set of CNs shall be 
sound/complete and duplicate-free. There are always a constraint, CA^^ax (the maximum 
size of CNs) to avoid generating complicated but less meaningful CNs. In the imple- 
mentation, we adopt the state-of-the-art CN generation algorithm proposed in [12]. 

Example 2. In Example[T[ there are two non-empty query tuple sets and A^. Using 
them and the database schema graph, if CN^^x = 5, the generated CNs are: CNi = 
P^, CN2 = A2, CN3 = PQ <^ W ^ A^, CN4 ^ P^ ^ W ^ A^ ^ W ^ P^, 
CN5 ^ P^ <- W ^ A'' <- W ^ P^, CNe ^ A^ ^ ^ P^ ^ W ^ A^ and 
CNj = A2 <- W ^ <- W ^ A2. 



2.4 Scoring Method 

The problem of continual top-k keyword search we study in this paper is to continually 
report top-fc JTTs based on a certain scoring function that will be described below. We 
adopt the scoring method employed in |4|, which is an ordinary ranking strategy in the 
information retrieval area. The following function score(T, Q) is used to score JTT T 
for query Q, which is based on the TF-IDF weighting scheme: 

.7, rr. Y.ieTtscore(t,Q) 

score(T, Q) = — — , (1) 

size{l ) 

where f e T is a tuple (a node) contained in T . tscore{t, Q) is the tuple score of t with 
regard to Q defined as follows: 

l+ln(l+ln(f/,^)) J 



tscoreit, Q) = ^ ^ ' 1" 



' For example, we can transform CN «- IV -» as: SELECT * FROM W w, P p, A a 
WHERE w.pid = p.pid AND w.aid = a.aid AND p.pid in (pi, p2, ps) and a.aid in (ai, 03, as). 
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where is the term frequency of keyword W in tuple t, dfu, is the number of tuples in 
relation r(t) (the relation corresponds to tuple t) that contain W. dfu, is interpreted as the 
document frequency of W. dl, represents the size of tuple f, i.e., the number of letters in 
t, and is interpreted as the document length of t. N is the total number of tuples in r{t), 
avdl is the average tuple size {average document length) in r{t), and i(0<i<l)isa 
constant which usually be set to 0.2. 

Table [T] shows the tuple scores of the six matched tuples in Example [T] We suppose 
all the matched tuples are shown in Fig.[T[ and the numbers of tuples of the two relations 
are 150 and 180, respectively. Therefore, the top-3 results are T\ - p2 (score - 7.04), 
Ti — fli (score — 4.00) and — p2 ^ wi ^ a\ (score — 3.68). 

Table 1. Statistics and tuple scores of tuples of and 
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The score function in Eq. ([TJ has the property of tuple monotonicity, defined as 
follows. For any two JTTs T ^ ti 1X1 f 2 1X1 ... 1X1 f, and T' = f'j 1X1 1X1 ... 1X1 ?; 
generated from the same CN C, if for any !</</, tscore(t, Q) < tscore(t', Q), then we 
have score(T, Q) < score(T', Q). As shown in the following discussion, this property is 
relied by the existing top-A: query evaluation algorithms. 

3 Related Work 

3.1 Keyword Search in Relational Databases 

Given Z-keyword query Q - {wi,W2, ■ ■ ■ , if/), the task of keyword search in a relational 
database is to find structural information constructed from tuples in the database ifTSll . 
There are two approaches. The schema-based approaches 11121417191 141 151 in this area 
utilize the database schema to generate SQL queries which are evaluated to find the 
structures for a keyword query. They process a keyword query in two steps. They first 
utilize the database schema to generate a set of relation join templates (i.e., the CNs), 
which can be interpreted as select-project-join views. Then, these join templates are 
evaluated by sending the corresponding SQL statements to the DBMS for finding the 
query results. |2| proved how to generate a complete set of CNs when the CN^nx has 
a user-given value and discussed several query processing strategies when considers 
the common sub-expressions among the CNs. II1I2I14I15 1 all focused on finding all 
JTTs, whose sizes are < CNaax, which contain all / keywords, and there is no ranking 
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involved. In H and ID, several algorithms are proposed to get top-A; JTTs. We will 
introduce them in detail in Section |T2l 

The graph-based methods M3I8I5I6I10I16 1 model and materialize the entire database 
as a directed graph where the nodes are relational tuples and the directed edges are 
foreign key references between tuples. Fig. [TJb) shows such a database graph of the 
example database. Then for each keyword query, they find a set of structures (either 
Steiner trees ||3J, distinct rooted trees |5|, r-radius Steiner graphs |10|, or multi-center 
subgraphs lfT6l ) from the database graph, which contain all the query keywords and 
are connected by the paths in database graph. Such results are found by graph traver- 
sals that start from the nodes that contain the keywords. For the details, please re- 
fer the survey papers 113117 1. The materialized data graph should be updated for any 
database changes; hence this model is not appropriate to the databases that change fre- 
quently |17|. Therefore, this paper adopts the schema-based framework and can be 
regarded as an extension for dealing with continual keyword search. 



3.2 Top-^ Keyword Search in Relational Databases 

DISCOVER2 [lU proposed the Global-Pipelined (GP) algorithm to get the top-k results 



which are ranked by the IR-style ranking strategy shown in Section 2.4 The aim of the 
algorithm is to find a proper order of generating JTTs in order to stop early before all 
the JTTs are generated. It employs the priority preemptive, round robin protocol 1 18 1 to 
find results from each query tuple set prefix in a pipelined way, thus each CN can avoid 
being fully evaluated. 

For a keyword query Q, given a CN C, let the set of query tuple sets of C be 
{R^,R^, . . . ,Rm]- Tuples in each Rf are sorted in non-increasing order of their scores 
computed by Eq. |2] Let Rf -tj be the /-th tuple in Rf. In each Rf, we use Rf .cur 
to denote the current tuple such that the tuples before the position of the tuple are 
all processed, and we use Rf.cur <— Rf .cur + 1 to move Rf.cur to the next posi- 
tion. q{t\, t2,..., tm) (where f, is a tuple, and f, e Rf) denotes the parameterized query 
which checks whether the m tuples can form a valid JTT. For each tuple Rf .tj, we use 
scoreiC.Rf .t j, Q) to denote the upper bound score for all the JTTs of C that contain the 
tuple Rf.tj, defined as follows: 

n ti.tscore -i-Y,^i,^„.:,^:C.Rf.ti.tscore 

scoF^iCRf.tj, Q) = ^ - : (3) 

size{C) 

According to the tuple monotonicity property of Eq. (fTli and the sorting order of tuples, 

Q Q ' ' 

among the unprocessed tuples of C.Rf, scoreiC.Rf .cur, Q) has the maximum value. 

Algorithm GP initially mark all tuples in C.Pf (1 < / < m) of each CN C as 
un-processed except for the top-most ones. Then in each while iteration (one round), 
the un-processed tuple which maximizes the score value is selected for processing. 
Suppose tuple Co.Rf.cur maximizes score, processing Ca.Rf.cur is done by joining it 
with the processed tuples in the other query tuple sets of Co to find valid JTTs: all the 
combinations as (t\,t2, ts-i, Rf.cur, ts+i . . . , t,„) are tested, where f/ is a processed 
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tuple of Co.Rf (1 < / < m, i + s). If the A;-th relevance score of the found results 
is larger than score values of all the un-processed tuples in all the CNs, it can stop 
and output the k found results with the largest relevance scores because no results with 
higher scores can be found in the further evaluation. 

One drawback of the GP algorithm is that when a new tuple C.Rf.cur is processed, it 
tries all the combinations of processed tuples (fi, f2, . . . , fs-i, fj+i ■ • • , t,„) to test whether 
each combination can be joined with C.R^.cur. This operation is costly due to extremely 
large number of combinations when the number of processed tuples becomes large |fT9ll . 
SPARK |9 1 proposes the Skyline-Sweeping algorithm to reduce the number of combi- 
nations test. SPARK uses a priority queue Q to keep the set of seen but not tested 
combinations ordered by the priority defined as the score of the hypothetical JTT corre- 
sponding to each combination. In each round, the combination in Q with the maximum 
priority is tested, then all its adjacent combinations are inserted into Q but only the 
combinations that have the high priorities are tested. SPARK still can not avid testing 
a huge number of combinations which cannot produce results, though the number of 
combinations test is highly reduced compared to DISCOVER2. 

This paper evaluates the CNs in a pipelined way like [41 and |[9], but also em- 
ploys the following two optimization strategies, whose high efficiencies are shown in 
II2I14I15I : (1) sharing the computational cost among CNs; and (2) adopting tuple reduc- 
tion. 

3.3 Keyword Search in Relational Data Streams 

The most related projects to our paper are S-KWS (14] and KDynamic 1201 15L which 
try to find new results or expired results for a given keyword query over an open-ended, 
high-speed large relational data stream [13 1. They adopt the schema-based framework 
since the database is not static. This paper deals with a different problem from S-KWS 
and KDynamic, though all need to respond to continual queries in a dynamic environ- 
ment. S-KWS and KDynamic focus on finding all query results. On the contrary, our 
methods maintain the top-A: results, which is less sensitive to the updates of the under- 
lying databases because not every new or expired results change the top-A: results. 

S-KWS maps each CN to a left-deep operator tree, where leaf operators (nodes) are 
tuple sets, and interior operators are joins. Then the operator trees of all the CNs are 
compacted into an operator mesh by collapsing their common subtrees. Joins in the 
operator mesh are evaluated in a bottom-to-top manner. A join operator has two inputs 
and is associated with an output buffer which saves its results (partial JTTs). The output 
buffer of a join operator becomes input to many other join operators that share the join 
operator. A new result that is newly outputted by a join operator will be a new arrival 
input to those joins sharing it. The operator mesh has two main shortcomings lfT9]| : 
(1) only the left part of the operator trees can be shared; and (2) a large number of 
intermediate tuples, which are computed by many join operators in the mesh with high 
processing cost, will not be eventually output in the end. 

For overcoming the above shortcomings of S-KWS, KDynamic formalizes each CN 
as a rooted tree, whose root is defined to be the node r such that the maximum path 



Scalable Continual Top-k Keyword Search in Relational Databases 



9 



from r to all leaf nodes of the CN is minimized; and then compresses all the rooted 
trees into a X-Lattice by collapsing the common subtrees. Fig. |3ja) shows the lattice 
of two hypothetical CNs. Each node V in the Lattice is also associated with an output 
buffer, which contains the tuples in V that can join at least one tuple in the output buffer 
of its each child node. Thus, each tuple in the output buffer of each top-most node V, i.e., 
the root of a CN, can form JTTs with tuples in the output buffers of its descendants. The 
new JTTs involving a new tuple are found in a two-phase approach. In the filter phase, 
as illustrated in Fig.[3|b), when a new tuple ?new is inserted into node R4, KDynamic uses 
selections and semi-joins to check if (1) fnew can join at least a tuple in the output buffer 
of each child node of R^, and (2) fnew can join at least a tuple in the output buffers of 
the ancestors of Ra,. The new tuples that can not pass the checks are pruned; otherwise, 
in the join phase (shown in Fig.|3jc)), a joining process is initiated from each tuple in 
the output buffer of each root node that can join fnew^ in a top-down manner, to find the 
JTTs involving fnew 



tnew(R 




(Ri«) (R2O) (R/J) (Rz^) (^F) 

(a) X-Lattice of two CNs (b) Filter phase 

Fig. 3. Query processing in KDynamic 




(c) Join phase 



In this paper, we incorporate the ranking mechanisms and the pipelined evalua- 
tion into the query processing method of KDynamic to support efficient top-A; keyword 
search in relational databases. 



4 Continual Top-A^ Keyword Search in Relational Databases 
4.1 Overview 

Database updates bring two orthogonal effects on the current top- A; results: 

1. They change the values of dfu,, N, and audi in Eq. Q and hence change the rele- 
vance scores of existing results. 

2. New JTTs may be generated due to insertions. Existing top-A: results may be expired 
due to deletions. 

Although the second effect is more drastic, the first effect is not negligible for long-term 
database modifications. Thus, we can not neglect all the JTTs that are not the current 
top-k results because some of them have the potential of becoming the top-k results in 
the future. This paper solves this problem by bounding the future relevance score of 
each result. We use score" to denote the upper bound of relevance score for each result. 
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Then, the resuhs whose score" values are not larger than relevance score of the top-/:-th 
results can be safely ignored. 

The second challenge is shortage of top-A: results because they can be expired due to 
deletions. Since the value k is rather small compared to the huge number of all the valid 
JTTs, the possibility of deleting a top-A: result is rather small. In addition, new top-k 
results can also be formed by new tuples. Thus, if the insertion rate is not much smaller 
than the deletion rate, the possibility of occurring of top-k results shortage would be 
small. However, this possibility would be high if the deletion rate is much larger, which 
can result in frequent top-^ results refilling operations. It worth noting that the top- 
k results shortage can also be caused by the relevance score changing of results. Our 
solution to this problem is to compute the top-{k + Ak) (Ak > 0) results instead of the 
necessary k. Ak is a margin value. Then, we can stand up to Ak times of deletion of top 
results when maintaining the top-k results. The setting of Ak is important. If Ak is too 
small, it may has a high possibility to refill. If Ak is too large, the efficiency of handling 
database modifications is decreased. Instead of analyzing the update behavior of the 
underlying database to estimate an appropriate Ak value, we enlarge Ak on each time of 
top-A; results shortage until it reaches a value such that the occurring frequency of top-A: 
results shortage falls below a threshold. 

On the contrary, after maintaining the top-A; results for a long time, the number of 
computed top results maybe larger than (k + Ak), especially when the insertion rate 
is high. In such cases, the top-A: results maintaining efficiency is decreased because 
we need to update the relevance scores for more results and join the new tuples with 
more tuples than necessary. As shown in the experimental results, such extra cost is 
not negligible for long-term database modifications. Therefore, we need to reverse the 
pipelined query evaluation if there are too many computed top results. 

In brief, when a continual keyword query is registered, we first generate the set 
of CNs and compact them into a lattice X.. Then, the initial top-A: results is found by 
processing tuples in i] in a pipelined way until the score" values of the un-seen JTTs 
are not larger than relevance score of the top-(A: + zlA:)-th result (which is denoted by 
£..&). When maintaining the top-A: results, we only find the new results that are with 
score" > X.-0. The pipelined evaluation of X. is resumed if the number of found results 
with score" > X..9 falls below k, or is reversed if the above number is larger than 



(A: + Ak). The method of computing score" for results is introduced in Section 4.2 



Section 4.3 and Section 4.4 describe our method of computing the initial top-k results 



and maintaining the top-k results, respectively. Then, two techniques which can highly 



improve the query processing efficiency are presented in Section 4.5 and Section 4.6 



4.2 Computing Upper Bound of Relevance Scores 

Let us recall the function for computing tuple scores given in Eq. (j2]) 
tscoreiuQ)^ V i±Mi±]!^ . i„ ( ^ 
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We assume that the future values of each In ( j^^) ^nd audi both have an upper bound 

In" ( j^^) and audi", respectively. Then, we can derive the upper bound of the future 
tuple score for each tuple t as: 



t.tscore 



z 



1 + ln(l + ln(f/,,„)) 



1 - 5 + i 



dl, 
audi" 



■In" 



wetnQ 

Hence, the upper bound of the future relevance score of a JTT T is: 

1 



T. score" - ^ t.tscore" ■ 



size{T) 



(4) 



(5) 



Note that the function in Eq. (jSj) also has the tuple monotonicity property on tscore" . 

On query registration, each In" ( ^y^+j ) is computed as (i-a;/ ^^'^ e^&ch 

audi" is computed as avdl(l + Aaudl), where Adf^j and Aavdl both are set as small 
values (= 1%). When maintaining the top-fc results, we continually monitor the change 
of statistics to determine whether all the ln(j^^) and avdl values below their upper 

bounds. At each time that any In (jj^) or avdl value exceeds its upper bound, the Adf^ 
or Aavdl is enlarged until the frequencies of exceeding the upper bounds fall below a 
small number. 

Example 3. Table|2]shows the tscore" values of the six matched tuples in Example 1 by 
setting = 20% wd Aavdl = 10%. Hence, T^.score" = 7.42, T2.score" = 4.23 and 
Ti.score" = 3.88. 

Table 2. Upper bounds of tuple scores 



Tuple 




03 




Pi 


P2 


Ps 


tscore" 


4.23 


3.64 


3.60 


3.52 


7.42 


3.57 



4.3 Finding Initial Top-A^ Results 

Fig. |4] shows the X-lattice of the seven CNs in Example |2] We use V, to denote a node 
in X. Particularly, denotes a lattice node of query tuple set, and vf .R^ denotes 
the query tuple set of vf. The dual edges between two nodes, for instance, vf and 
Vs, indicate that Vg is a dual child of . A node V,- in X can belongs to multiple 
CNs. We use Vi.CN to denote the set of CNs that node V, belongs to. For example, 
V^.CN = {CN2,CN2,CN6,CNi}. Tuples in each query tuple set vf.R^ are sorted in 
non-increasing order of tscore". We use vf.cur to denote the current tuple such that the 
tuples before the position of the tuple are all processed, and we use vf .cur <— vf .cur+\ 
to move vf .cur to the next position. Initially, for each node in vf.cur is set as 
the top tuple in vf.R^. In Fig.ji] V .cur of the four nodes are denoted by arrows. For a 
node y, that is of a free tuple set R^ , we regard all the tuples of R^ as its processed tuples 
for all the times. We use Vj.output to indicate the output buffer of V,, which contains its 
processed tuples that can join at least one tuple in the output buffer of each child node 
of y,. Tuples in Vi.output are also referred as the outputted tuples of V,. 
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Fig. 4. The constructed lattice of the seven CNs in Example |2] 

In order to find the top-A; results in a pipelined way, we need to bound the score" 
values of the un-found results. For each tuple tj of vf .R^, the maximal score" values 
of JTTs that tj can form is defined as follows: 



, n \ 1 0, a child node of V has empty output buffer, 

score" (v'^,tj,Q]^{ , /' ^ J 

\ ' ^ ' ym&x^^yQ f,^yscore"{C.R'^.ti,Qjj, otherwise 



(6) 



where score" {c.R^.tj, indicates the maximal score" for all the JTTs of C that con- 
tain tuple tj, and is obtained by replacing tscore in Eq. (Oil with tscore". If a child of 

o o 
Vr^ has empty output buffer, processing any tuple at can not produce JTTs; hence 

score" (yf^, tj, = in such cases, which can choke the processing tuples at until 
all its child nodes have non-empty output buffers. According to Eq. ^ and the tuples 
sorting order, among the un-processed tuples of .R^, score" (vf, vf .cur, has the 
maximum value. We use score" {vf, to denote score" (yf, Vf.cur, Q). In Fig. |4j 
score" (yf, 2) values of the four vf nodes are shown next to the arrows. For example, 
score" iyf, 2) = ra<iXce{CN2,CN^,CN(„CN^] [score" (C.A'^.auQ)) = 4.23. 

Algorithm[T]outlines our pipelined algorithm of evaluating the lattice £. to find the 
initial top-A: results, which is similar to the GP algorithm. Line s[T]|3] are the initialization 
step to sort tuples in each query tuple set and to initialize each vf.cur. Then in each 
while iteration (lines Hlsll, the un-processed tuple in all the V ^ nodes that maximizes 



score" is selected to be processed. Processing the selected tuples is done by calling the 
procedure Insert. Algorithm 1 stops when maXj^e^^ score"{Vf, Q) is not larger than 
the relevance score of the top^fe + Ak)-\h found results. The procedure Insert(Vi, t) is 



provided in KDynamic, which updates the output buffers for V, (line 1 3 1 and all its an- 
cestors (lines TTpS 1, and finds all the JTTs containing tuple t by calling the procedure 
EvalPath (line 16 1. We will explain procedure Insert using examples later. The re- 
cursive procedure EualPath(Vi, t, path) is provided in KDynamic too, which constructs 
JTTs using the outputted tuples of y,'s descendants that can join t. The stack path, 
which records where the join sequence comes from, is used to reduce the join cost. 

Example 4. In the first round, tuple V^.pz is processed by calling InsertiVg, p2). Since 
Vg is the root node of CNi , EvalPath is called and JTT - p2 is found. Then, for the 
two father nodes of V^, V(, and Vj, V^-output is not updated because vf .output = 0, 
Vj. output is updated to {w\,w-i] because p2 can join w\ and w-j. And then, for the two 
father nodes of Vi, and V4, V^. output is not updated since has no processed 
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Algorithm 1: EvalStatic-Pipelined (lattice X, the top-^ value k, Ak) 

1 topk «— 0: the priority queue for storing found JTTs ordered by score; 

2 Sort tuples of each v9.rQ in non-increasing order of tscore"; 

3 foreach node Vf in £ do let Vf.cur «— vf.R^.t\ ; 

4 while maXygg^ score^iVf, Q) > topk[k + Ak]. score do 

5 Suppose icore"(V^, Q) = max^,eg^ score"{v9, Q); 

6 path <— 0; //A stack which records the join sequence 

7 Insert{V^, .cur); // Processing tuple V^.cur at 

8 Vq .cur «— .cur + 1; 

9 Output the first k results in topk; 

10 £.9 «— topk[k + Ak]. score; 

11 Procedure /niert(lattice node Vj, tuple t) 

12 at t Vi.output and t can join at least one outputted tuple of every child of V, then 

13 I Insert t into Vi.output; 

14 if t e Vi.output then 

15 Push (V,,0 to path; 

16 if Vj is a root node then topk «— topk IJ EvalPath{V, t, path); 

17 foreach father node of V,, Vf in £. do 

18 I foreach tuple f' belongs to V)' that can join < do /nierf(V,/,?'); 

19 Pop (V,t) from pafA; 

20 Procedure EvalPath(\aUice node V,, tuple f, stack paf/i) 

21 r «- [t]; // The set of found JTTs 

22 foreach child node of V,, Vi' in £ do 

23 7"' <- 0; // The set of JTTs that rooted at tuples of node Vf 

24 if Vii 6 path then 

25 let t' be the tuple of node V,' that is stored in path; 

26 r' <- EvalPath{Vi> , t', path); 

27 else 

28 foreach tuple f' e F;' .output that join f do 

29 T' <-T'\JEvalPath{Vi',t',path); // Union the JTTs that rooted 
at different tuples of V,/ 

30 T <^T X T'; // Compute the Cartesian Product 

31 return 7"; 
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tuples, Vn.output is set as {02) because there is only one tuple 02 in that can join 
Wi and w-i. Since V4 is the root node (of CN^), EualPathiV^, 02, path) is called but no 
results are found because the only one found JTT p2 ^ w-j ^ 02 ^ wj ^ p2 is not a 
valid result. After processing tuple V^.p2, score" (vf, - 3.82 and score" (v^, = 

3.57. In the second round, tuple Vg.ai is processed, which finds results T2 - a\ and 
7^3 = P2 ^ — » a\. Then, V2-output — {/?4), V^.output — {wi,W4}, Vd.output — [w\}, 
score" (yp, Q) = 3.18, and score" (vf , g) = score" {CNj,.AQ .ai,Q) = 3.69. hi the 

third-fifth rounds, tuples vf.ai, V^.a^ and V^.a^ are processed, which insert ai into 
vf .output and no results found. In the sixth round, tuple V^.a^ is processed, which 
finds results and ai ^ W4 p4 i— we —> aj,. Then, Algorithm [T] stops because the 
relevance score of the third result in the queue topk (suppose Ak - 0) is larger than all 
the score" (vf, Qj values. Fig.jsjshows the snapshot of X after finding the top-3 results. 
Thus, 6 = 3.68 after the evaluation. 





3. 67T 



iPilPslPa 
3.57T 




topk 



JTT 


score (") 


P'l 


7.04 (7.42) 


ai 


4.00 (4.23) 


Pa^wi— ai 


3.68 (3.89) 


as 


3.40 (3.64) 


ai^w,i^p.i 
^we^as 


1.48 (1.57) 



Fig. 5. After finding the top-3 results (tuples in the output buffers are shown in bold) 

After the execution of Algorithm[T| score" values of all the un-found results are not 
larger than X.-0. Results in the queue topk can be categorized into three kinds. The first 
kind are the (k + Ak) results that are with score" > X.-0, which are the initial top-(k+Ak) 
results. The second kind are with score < X..0 and score" > £,.6, which are called the 
potential top-(A; + Ak) results because they have the potential to become the top-(A; + Ak) 
results. The third kind are with score" < X.-0. As shown in the experiment, the results 
of the last kind may have a large number. However, we can not discard them because 
some of them may become the first two kinds when maintaining the top-^ results. 



4.4 Maintaining Top-^ Results 

Algorithm|2] shows our algorithm of maintaining top-fc results. A database update oper- 
ator is denoted by OP(t,Rt), which represents a tuple t of relation R, is inserted (if OP 
is a insertion) or deleted (if OP is a deletion). Note that the database updates is modeled 
as deletions followed by insertions. For a new arrival OP(t, R,), Algorithm|2]first checks 
whether the In ( jfTi) '^"'^^ values of relation R, exceed their upper bounds. If some 
In ^2j^)(s) or audi exceeds their upper bounds, we enlarg^the corresponding Adfw(s) 

^ The methods of enlarging Adf^, Aavdl and Ak are introduced in detail in the experiments. 
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or Aavdl (line|3]l, and then update the score and score" values for all the tuples in /?p 
and all the results in the queue topk using the enlarged In ( j^^)(s) or avdl (line 4b; oth- 
erwise, we update the relevance scores for the results in topk that are with scored £..6 
(line[6]). Then, we insert t into X. to find the new results if OP is an insertion (lines[7p3|), 
or delete the expired JTTs and t from X. if OP is a deletion (lines T4p7 1. Lines [7p7] 
are explained in detail latter And then, the score"{Vf , Q) of some nodes may be large 
than X.6, which can be caused by three reasons: (7) the upper bound scores of tuples 
of relation R, are increased; (2) the score"{V^, Q) of some nodes are increased from 
after inserting the new tuple into and (5) new CNs are added into Therefore, in 
lines [ Tsfig we process tuples using procedure Insert until all the score" \V^, Q) values 



are not larger than X..6. 



Algorithm 2: Maintainitht evaluated lattice X, the top-A; value k, Ak) 



1 while a new database modification OP{t, R,) arrives do 



2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 



if Some In( ^^^^j ) (or avdl) exceed their upper bounds after applying OP then 
Enlarge the corresponding zW/„, {or Aavdl) value(s); 
Update relevance scores for tuples in Rf and results in topk; 
else 

I Update score for results in topk that are with score" > £..6; 
if OP is an insertion then // Insert t into £. 

if / is an un-matched tuple then 

I foreach node V, in £. that of Pf do Insert(Vi, t); 
else 

if R^ is new then add the new CNs into £,; 
Insert t into R^ in descending order of tscore"; 

foreach vf that of Rf and has score" (vf, t, q) > £.9 do InsertiVf, t); 
else if OP is a deletion then // Delete t from £ 

Delete the results that contain / and are with score" > £.6 from topk; 
if / is a matched tuple then remove t from Pp; 
foreach node V, in £ such that / e Vj.output do Delete{Vi, t); 
while mdiXyQ^j^ score"{Vf, Q) > £.6 do 

I foreach node Vf that is with score"(Vf , Q) > £.9 do Insert{Vf , Vf .cur); 
if \{T\T e topk, T .score > £.d]\ < k then // Resume the evaluation of £ 

I Enlarge Ak and then resume the execution of EvalStatic-Pipelined; 
else if lirir 6 topk, T. score > £.e\\ >{k + Ak) then 

I RollBack{£,k,Ak); // Reverse the evaluation of £ 

Report the new first k results in topk if they are changed; 



25 Procedure Delete(Vi, t) 

26 Delete t from Vj.output; 

27 foreach father node of V,, V,- in £ do 

28 foreach tuple t' in V,/ .output that can join t only do 

29 I Delete(Vi',t'); 



// Call Delete recursively 
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Finally, in lines 20p3 we count the number of results that are with score" > X.-0. 
If the number is smaller than k, Ak is enlarged, and then the EvalStatic-Pipelined algo- 
rithm (without the initialization step) is called to further evaluate If the number is 
larger than k + Ak, the algorithm RollBack, which is described at the end of this sub- 
section, is called to rollback the evaluation of In any case, at the end of handling the 
OP, we have maXj^e^^ score"{Vf .tcr, Q) < topk[k]. score. Therefore, the k results in 
topk that have the largest relevance scores are the top-A; results. We do not process the 
results in topk that are with score" < X.-0 in line |6] and line 15 because they can have 
a large number and do not have the potential to become top-A: results. However, after 
the execution of lines |4] and 21 score" of some of them may become larger than X..6, 
because their score" values may be enlarged in line |4] and the X-.9 may be decreased in 
line|2T] Therefore, all the results in topk need to be considered in lines |4] and |2T| Note 
that we have to firstly check whether some of them have expired due to deletions. 

In lines 7p3 the new tuple t is processed differently according to whether it con- 
tains the keywords. If t is an un-matched tuple, it is inserted into each node of using 
the procedure Insert (line|9]l. If f is a matched tuple, inserting it into X. is more compli- 
cated. First, if t introduces a new non-empty query tuple set Rf, we add the new CNs 
involving Rf into the lattice. Fig. |6] illustrates the process of inserting a new CN into 
the lattice shown in Fig.js] Assuming that W — is the largest common subtree of the 
new CN and X, and Vf is the father node of W — P^ in the new CN, then the new CN is 
added by setting Vj as the child of V/. If Vf is a free tuple set and it does not have other 
child nodes as shown in Fig. |6j InsertiVj, f) is called for each tuple t' of Vf that can 
join tuples in Vj. output. Further evaluation at the nodes of the new CN, if necessary, 
will be done in lines [Tspg] Second, t is added into the query tuple set R^ (line[l2]l, and 
then for each node vfof Pf, Insert{vf , t) is called when score" [vf.R^.t, Q)> £.9 
(hneflSll, i.e., t has the potential to form JTTs that ai^e with score" > £,.0. 



Rooted tree 
of a new CN 



Lattic 




Fig. 6. Inserting a new CN into the lattice 



If OP is a deletion, for each node V, in £ such that t e Vi.output, we delete t 

from Vi.output using the procedure Delete, which is provided by KDynamic. Procedure 

Delete first removes t from Vi.output, and then checks whether some outputted tuples 

of the ancestors of V,- need to be removed (lines 27p9 i. For instance, if the tuple aj, is 

deleted from the lattice node V^ shown in Fig.pTtuples and W(, are deleted from 

— o 

Vi.output too because they can join 03 only, among tuples in Vi.output. 

Algorithm [3] outlines out algorithm to reverse the execution of the pipelined evalu- 
ation of the lattice. In the beginning, £.0 is set as the relevance score of the {k + Ak)-\h 
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result in the queue topk (line^. Then, the processing on each processed tuple 
that is of score" (vf.R^.t, Qj < X..6 is reversed (lines 4jj6 1. We use vf.cur - 1 to de- 
note the tuple just before vf.cur. If f e Rf .output, the results involving by t are firstly 
deleted from topk, and then f is deleted from vf .output by calling the procedure Delete. 



t e RY 



Algorithm 3: RollBack(a lattice X, the top-fc value k, Ak) 

1 £.9 «— topk[k + Ak]. score; 

2 foreach node Vf in £. do 

3 while score" [vj^. cur ~ I, q) < £.6 do 

4 if vf .cur - 1 e vf .output tlien 

5 Remove the results that are of CNs in vf.CN and contain tuple Vf .cur - 1 
from topk; 

6 DeleteiV^, V^ .cur - I); // Delete from the output buffer 

7 Vf .cur «— vf.cur — 1; 



4.5 Caching Joined Tuples 

In Algorithm[T]and Algorithm|2j procedure Insert and Delete may be called by multiple 
times upon multiple nodes for the the same tuple. The core of the two procedures are 



the select operations (or semi-joins 1 15 1). For example, in line 12 and line 18 of proce- 
dure Insert, we need to select the tuples that can join t from the output buffer of each 
child node of V, and the set of processed tuples of each father node of V,, respectively. 
Although such select operations can be done efficiently by the DBMS using indexes, 
the cost of handling t is high due to the large number of database accesses. For example, 
in our experiments, for a new tuple t, the maximal number of database accesses can be 
up to several hundred. 

These select operations done for the same tuple t can be done efficiently by shar- 
ing the computational cost among them. Assume a new tuple wq is inserted into the 
lattice shown in Fig.|5] then procedure Insert is called by three times {Insert{Vs,WQ), 
Insert(y-i,WQ) and Insert{V(,,W())) and at most eight selections are done. All the eight 
select operations can be expressed using following two relational algebra expressions: 
naidio-wid=wo(W) N o-aideM^)) ™d 71 piAcTw^woiW) M cr^,ep,(P)), where and Pj 
represent the set of tuples in the output buffer of a node or the set of processed tu- 
ples of a node. Since y[, and 'Pj can be different from each other, the eight select 
operations need to be evaluated individually. However, if we rewrite the above ex- 
pressions as naid{o-aidey[;ia-u}id=w„(W) 1X1 (A))) and 7r^,rf(cr;,ep/cr^=„„(W) ^ (P))), the 
eight select operations would have two common sub-operations: cr„=u,^(W) 1X1 (A)) and 
o'w=wo(W) ^ (P))- If the results of the two common sub-operations can be shared and 
do selections cr„,y(=^, and o-pi^ep in the main memory, the eight select operations can be 
evaluated involving only two database accesses. 
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Algorithm 4: CanJoinOneOutputTuple(laUice node V,, tuple f) 

1 Let Rj be the relation corresponding to the tuple set of V, ; 

2 if the tuples of relation Rj that can join / have not been stored then 

3 I Query the tuples of relation Rj that can join / and store them; 

4 foreach tuple t' of the stored tuples of relation Rj that can join t do 

5 I if can find t' in Vj.output then return true; 

6 return false; 



Algorithm |4] shows our procedure to check whether tuple t can join at least one 
tuple in the output buffer of a lattice node V,, which is called in line 12 of procedure 



Insert. In line [3] all the tuples in relation /?, that can join t are queried and cached in 
the main memory. This set of cached joined tuples can be reused every time when they 



are queried. The procedures for the select operations in line 18 of Insert and line 28 
of Delete are also designed in this pattern, which are omitted due to the space limita- 
tion. Note that when the two procedures Insert and Delete are called recursively, select 
operations done in the above lines are also evaluated by these procedures. Therefore, 
for each tuple f, a tree of tuples, which is rooted at t and consist of all the tuples than 
can join t, is created. The tree of tuples can be seen as the cached localization infor- 
mation of t. It is created on-the-fly, i.e., along with the execution of procedures Insert 
and Delete, and its depth is determined by the recursion depth of the two procedures. 
The maximum recursion depth of procedures Insert and Delete is + 1 IfTSl . where 
CA^max indicates the maximum size of the generated CNs. Hence, the height of this tree 
of tuples is bounded by + 1 too. 

Suppose a new tuple po of is inserted into the two nodes of in the lattice 
shown in Fig. |5] Fig. |7] illustrates the select operations done in the procedure Insert 
(denoted as arrows in the left part) and the cached joined tuples of po (shown in the 
right part). For instance, the arrows form Vg to Vj selects the tuples in relation W that 
can join po. The three select operations are denoted by dashed arrows because they 
would not be done if results of the two select operations, from Vg to V-j and from Vg to 
Vf,, are empty. For the same reason, the stored tuples of relation A that can join po are 
denoted using dashed rectangles. 




(AO) r?) 



(g) 



^v' Cached joined 




wl, w2, . . . 


1 





il, a2, . . . i 



Fig. 7. Selections done in Insert and the cached joined tuples for a tuple po of P^ 



When computing the initial top-A: results, the database is static; hence the cached 
joined tuples of each tuple unchange and can be reused before the database is updated. 
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When maintain the lop-k results, although the database is continually updated, we can 
assume the database unchange before t is handled. However, the cached joined tuples 
of t is expired after t is handled by Algorithm |2] As shown in the experimental results, 
caching the joined tuples can highly improve the efficiency of computing the initial 
top-^ results and maintaining the top-A: results. 

4.6 Candidate Network Clustering 



According to Eg. ([3| , score" values of tuples in different CNs have great differences. 
For example, score" values of tuples in CN5 and CNj are smaller than that of tuples in 
CNi due to the large CN size. In algorithm GP, no tuples or only a small portion are 
joined in the CNs whose tuples have small score values. If the CNs in Example |2] are 
evaluated by algorithm GP, of CN^ and of CN<, would have no processed tuples. 
However, in the lattice, a node Rf can be shared by multiple CNs. Thus, when inserting 
a tuple t into Rf, t is processed in all the CNs in Rf .CN. As shown in Fig. [s] since 
is shared by CN2, CN3, CNd and CNj in the lattice, tuples ai and 03 are processed in all 
these four CNs when processing them at vf, which results in un-needed operations at 
nodes V2 and V5 two un-needed results 03 and oi «— 1114 — > /?4 «— 105 — » 03. We call the 
operations at V2 and V;, and the two JTTs as un-needed because they wound not occur 
or be found if the CNs are evaluated separately. These un-needed operations can cause 
further un-needed operations when maintaining the top-^ results. For example, we have 
to join a new unmatched tuple of relation P with four tuples in Vs. output. 

The essence of the above problem is that CNs have different potentials in producing 
top-A; results, and then the same tuple set can have different numbers of processed tuples 
in different CNs if they are evaluated separately. In order to avoid finding the un-needed 
results, the optimal method is merely to share the tuple sets that have the same number 
of processed tuples among CNs when they are evaluated separately. However, we cannot 
get these numbers without evaluating the CNs. As an alternative, we attempt to estimate 
this number for the tuple sets of each CN C according to following heuristic rules: 

E \<i<m C.Rf.ti .tscore" 

- If Max(C) - , which indicates the maximum score" of JTTs 

size(C) 

that C can produce, is high, tuple sets of C have more processed tuples. 

- If two CNs have the same Max(C) values, tuple sets of the CN with larger size have 
more processed tuples. 

Therefore, we can cluster the CNs using their Max(Cyin{size{C)) values, where ln(size(C)) 
is used to normalize the effect of CN sizes. Then, when constructing the lattice, only the 
subtrees of CNs in the same cluster can be collapsed. For example, Max(C) ■ ln{size{C)) 
values of the seven CNs of Example |2] are: 5.15, 2.93, 5.39, 6.84, 5.32, 5.70 and 3.03; 
hence they can be clustered into two clusters: {CN2, CNj] and {CNi , CA^3, CA^4, CNs, CN(,}. 
Fig. [8] shows the lattice after finding the top-3 results if the CNs are clustered, where 
the three un-needed JTTs in Fig.|5]can be avoided. As shown in the experimental sec- 
tion, clustering the CNs can highly improve the efficiency in computing the initial top-^ 
results and handling the database updates. 
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We cluster the CNs using the A'-mean clustering algorithm II2TI . which needs an in- 
put parameter to indicate the number of expected clusters. We use Kmean to indicate the 
ratio of this input parameter to the number of CNs. The value of Kmean represents the 
trade-off between sharing the computation cost among CNs and considering their dif- 
ferent potentials in producing top-A: results. When Kmean - 0, the CNs is not clustered, 
then the CNs share the computation cost at the maximum extent. When Kmean - 1, 
all the CNs are evaluated separately. In our experiments, we find that Kmean - 0.6 is 
optimal both for computing the initial top-A; results and handling the database updates. 
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Fig. 8. After finding the top-3 results if the CNs are clustered into two clusters 



5 Experimental Study 

We conducted extensive experiments to test the efficiency of our methods. We use 
the DBLP datasej^ Note that DBLP is not continuously growing and is updated on a 
monthly basis. The reason we use DBLP to simulate a continuously growing relational 
dataset is because there is no real growing relational datasets in public, and many stud- 
ies B4I9I on top-A^ keyword queries over relational databases use DBLP. The downloaded 
XML file is decomposed into relations according to the schema shown in Fig. |9] The 
two arrows from PaperCite to Papers denote the foreign-key-references from paperlD 
to paperlD and citedPaperlD to paperlD, respectively. The DBMS used is MySQL 
(v5.1.44) with the default "Dedicated MySQL Server Machine" configuration. All the 
relations use the MylSAM storage engine. Indexes are built for all primary key and 
foreign key attributes, and full-text indexes are built for all text attributes. All the algo- 
rithms are implemented in C++. We conducted all the experiments on a 2.53 GHz CPU 
and 4 GB memory PC running Windows 7. 



5.1 Parameters 

We use the following five parameters in the experiments: (1) k: the top-A: value; (2) /: 
the number of keywords in a query; (3) IDF: the ratio of the number of matched tuples 
to the number of total tuples, i.e., ^; (4) CA^^ax^ the maximum size of the generated 
CNs; and (5) Kmean: the ratio of the number of clusters of CNs to the number of CNs. 



^ http://dblp.mpi-inf.mpg.de/dblp-mirror/index.php/ 
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Fig. 9. The DBLP schema (PK stands for primary key, FK for foreign key) 

The parameters with their default values (bold) are shown in Table. [3] The keywords 
selected are listed in Table. |4] with their IDF values, where the keywords in bold fonts 
are keywords popular in author names. Ten queries are constructed for every IDF value, 
each of which contains three selected keywords. For each I value, ten queries are con- 
structed by selecting I keywords from the row of IDF = 0.013 in Table. |4] To avoid 
generating a small number of CNs for each query, one author name keyword of each 
IDF value always be selected for each query. 

When k grows, the cost of computing the initial top-A; results increases since we 
need to compute more results, and the cost of maintaining the top-^ results also in- 
creases since there are more tuples in the output buffers of the lattice nodes. The pa- 
rameter CNmnx has a great impact on keyword query processing because the number 
of generated CNs increases exponentially while CA^max increases. And the number of 
matched tuples increases as IDF and I increase. Hence, the first four parameters k, I, 
IDF and CNmax have effects on the scalability of our method. 



Table 3. Parameters 



Table 4. Keywords and their IDF values 



Name 


Values 


k 


50, 100, 150, 200 


I 


2, 3, 4, 5 


IDF 


0.003, 0.007, 0.013, 
0.03 




4, 5, 6, 7 


Kmean 


0, 0.20,0.40,0.60,0.80,1 



Keywords 


IDF 


ATM, embedded, navigation, privacy, scalable. Spatial, XML, Charles, Eric 


0.004 


clustering, fuzzy, genetic, machine, optimal, retrieval, sensor, semantic, 
video, James, Zhang 


0.007 


adaptive, architecture, database, evaluation, mobile, oriented, security, simu- 
lation, wireless, John, Wang 


0.013 


algorithm, design, information, learning, network, software, time, David, 
Michael 


0.03 



5.2 Exp-1: Initial Top-^ Results Computation 

In this experiment, we want to study the effects of the five parameters on computing the 
initial top-A: results. We retrieve the data in the XML file sequentially until number of 
tuples in the relations reach the numbers shown in Table. [5] Then we run the algorithm 
EvalStatic-Pipelined on different values of each parameter while keeping the other four 
parameters in their default values. We use two measures to evaluate the effects of the 
parameters. The first is #R, the number of found results in the queue topk. The second 
measure is T, the time cost of running the algorithm. Ten top-A; queries are selected 



22 Yanwei Xu 



for each combinations of parameters, and the average values of the metrics of them are 
reported in the following. In this experiment, /Idf^, (= 1%), Aavdl {- 1%) and Ak {- 1) 
all have very small values because they will be enlarged adaptively when maintaining 
the top-A; results. 

The main results of this experiment are given in Fig.fTO] Note that the units for the 



i/-axis are different for the three measures. Fig. 10 a), (b) and (c) show that the two 



measures all increases as k, idf and CA^^ax grow. However, they do not show rapid 
increase in Fig. 10 ^a), (b) and (c), which imply the good scalability of our method. On 
the contrary, we can find rapid increase while CA^max grows from the time cost of the 
method of |9 1 in finding the top-A; results, which is shown in Fig. 10 c) and are denoted 
by T[S PARK]. Fig. [TO|c) presents that, compared to the existing method, algorithm 
EvalStatic-Pipelined is very efficient in finding the top-A: results. The reason is that 
evaluating the CNs using the lattice can achieve full reduction because all the tuples in 
the output buffer of the root nodes can form JTTs and can save the computation cost by 
sharing the common sub-expressions |15|. Fig. [TO|d) shows that the effect of / seems 
more complicated: all the two measures may decrease when I increases. As shown in 
Fig.[TO|d), #R and T even both achieve the minimum values when / = 5. This is because 
the probability that the keywords to co-appear in a tuple and the matched tuples can join 
is high when the number of keywords is large. Therefore, there are more JTTs that have 
high relevance scores, which results in larger 9 and small values of the two measures. 



Table 5. Tuple numbers of relations 



Papers 


PaperCite 


Write 


Authors 


Proceedings 


ProcEditors 


ProcEditor 


157,300 


9,155 


400,706 


190,615 


2,886 


1,936 


1,411 



Fig. 10 e) presents the changing of the two measures when Kmean varies. Since the 
results of the ^T-means clustering may be affected by the starting condition fSTl, for 
each Kmean value, we run Algorithm [T] for 5 times on different starting condition for 
each keyword query and report the average experimental results. Note that the algorithm 
EvalStatic in KDynamic corresponding to Kmean - since there is no CN clustering 
in KDynamic. From Fig. 10 e), we can find that clustering the CNs can highly improve 
the efficiency of computing the top-A: results and the time cost decreases as Kmean in- 
creases. However, when Kmean - 1, which indicates that all the CNs are evaluated 
separately, the time cost grows to a higher value than that when Kmean is 0.6 or 0.8. 
Therefore, it is important to select a proper Kmean value. The minimum T in this ex- 
periment is achieved on Kmean - 0.6; hence the default value of Kmean is 0.6 in our 
experiments. As can be seen in the next section, Kmean - 0.6 also results in the mini- 
mum time cost of handling database modifications. 

Fig. [TO|f) compares the time cost of our method in finding the top-A: results with 
that of KDynamic, while varying CA^^ax- The time cost of KDynamic is denoted by 
"ICache" because it does not cache the joined tuples for each tuple. We can find that 
caching the joined tuples for each tuple highly improves the efficiency of computing 
the top-A: results. More important, the improvement increases as CNmax grows. This 
is because when CNmax grows, the times of calling the procedure Insert on each tuple 
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Fig. 10. Experimental results of calculating the initial top-A; results 
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increases fast since the number of lattice nodes increases exponentially; hence the saved 
cost due to storing the joined tuples of each tuple grows as CA^max grows. 



From the curves of #R in Fig. 10 we can find that #R values are large in all the 
settings: about several thousand. Recall that topk contains three kinds of results. The 
number of the first kind of results is k + Ak, which is small compared to the #R values. 
Since Adfu, {- 1%), Aavdl and Ak all have very small values, the number of potential 
top-(A; + Ak) results in topk is very small (< 10). Therefore, the third kind of results, 
which are with score" < £.-0, is in the majority and has a lager number. 



5.3 Exp-2: Top-^ Result Maintenance 

In this experiment, we want to study the efficiency of Algorithm |2] in maintaining Xo^-k 
results. We use the same keyword queries as Exp-1. After calculating the initial top-^ 
results for them, we sequentially insert additional tuples into the database by retrieving 
data from the DBLP XML file. At the same time, we delete randomly selected tuples 
from the database. Algorithm|2]is used to maintain the top-fc results for the queries while 
the database being updated. The database update records are read from the database log 
file; hence the database updating rate has no directly impact on the efficiency of top-^ 
results maintenance because the database is updated by another process. 

We first add 713,084 new tuples into the database and delete 250,000 tuples from the 
database. The new data is roughly 90 percent of the data used in Exp- 1 . The composition 
of the additional tuples is shown in Table.j6] Fig. [TTJa) and (b) show the change of 
the average execution times of Algorithm |2j in handling the above database updates 
when varying the five parameter^ which presents the efficiency of Algorithm [2] Note 
that the units for the x-axis are different for the five measures, whose minimum and 
maximum values are labeled in Fig.[TTJa) and (b), and their other values can be found 
in Table. [3] We can find that the time cost of handling database updates for the default 



rrfa) and (b) with the curves of measure 
d) and Fig. [TO{e)), we can find that the 



queries is smaller than 1.5ms. Comparing Fig 
T in Fig. [To] (especially the curves in Fig. [TO 
time cost to handle database updates and the time cost to compute the initial top-^ 
results have the same changing trends. This is because there are more outputted tuples 
in the lattice when more time is needed to compute the initial top-^ results; hence more 
time is required to do the selections in procedures Insert and Delete and the recursive 
depthes of them are more larger. Fig. [TTJc) compares the time cost of our method in 
handling database updates with that of KDynamic, while varying CN^si^. The time cost 
of KDynamic is also denoted as " !Cache". We can find that caching the joined tuples for 
each tuple can also improve the efficiency of handling database updates, and the larger 
the CNmax, the higher the improvement of the efficiency is. 

Table 6. Composition of the additional tuples 



Papers 


PaperCite 


Write 


Authors 


Proceedings 


ProcEditors 


ProcEditor 


156,965 


20,010 


411,109 


111,094 


3,033 


3,886 


6,987 



Since it is hard to read in one figure, we split the data of the five parameters into two figures. 



Scalable Continual Top-k Keyword Search in Relational Databases 




(a) Time for handling database updates 
while varying Kmean and CNmitx 



(b) Time for handling database updates 
while varying /, k and IDF 
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Fig. 11. Efficiency of top-^ result maintenance 
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Secondly, we only insert the 713,084 additional tuples into the database while main- 
taining top- 1 00 results for the default ten keyword queries. We adopt two different grow- 
ing rates of Adfu,: Adf^,+ — 2% and Adfu,+ - 5%, which mean that when a ln( ^y^+i ) 
exceed its upper bound, the corresponding Adf^, value is increased by 2% and 5%, 
respectively. After inserting each 100,000 additional tuples, we record the average fre- 
quency of enlarging Ad fu, and calling the procedure RollBack for the ten queries, whose 



changes are shown in Fig. 11 d) and (e), respectively, whose .x-axis (with unit of 10^) 
indicate the number of additional tuples. Note that we do not report the frequency of 
enlarging avdl because it is very small in the experiment (< 2). 

Fig.[TT|d) shows rapid decrease after inserting the first 100,000 additional tuples. 
Although the frequency of enlarging Adfu, is larger when the growing rate of Adfu) 
is lower, after inserting 300,000 additional tuples, the times of enlarging Adfw, i.e., the 
times of exceeding the upper bound of In ( j^^), falls below 5 for both the two growing 
rates of Adf^. After inserting 300,000 additional tuples, the maximum Adfu, value of all 
the relations is 15; hence it is reasonable to set 15 as the maximum value for Adf^,- There 
is only one curve in Fig.[TT|e) because the growing rate of Adfu, has no great impact on 
the times of calling the procedure RollBack, which is mainly affected by the frequency 
of finding new results that are with score" > X.-0- Note that is increased after each 
time of calling the procedure RollBack. Therefore, the times of calling the procedure 
RollBack is decreasing since it is more and more harder to find new results that are with 
score" > X.-6. In order to study the impact of reversing the pipelined evaluation on the 
efficiency of handling database updates, we also redo the experiment without calling 
the procedure RollBack. Then, the average time cost of handling database updates is 
increased by 45.4%, which confirms the necessity of reversing the pipelined evaluation. 

Then, we delete 500,000 randomly selected tuples from the database after inserting 
the 713,084 additional tuples. Two different zlA; growing rates are adopted: Ak+ = 2 and 
Ak+ = 5, which mean that when the number of results that are with score" > X..0 falls 
below k, the corresponding Ak value is increased by 2 and 5, respectively. We record 
the average times of enlarging Ak of the ten queries after deleting each 100,000 tuples, 
whose changes are shown in Fig.[TTJf). Fig.[TTJf) shows that the frequency of shortage 
of top-k results falls below a very small number after deleting 200,000 tuples, i.e., after 
Ak being enlarged to about 20. As indicated by the curve of k in Fig.[TTJb), a large Ak 
value can highly decrease the efficiency of handling database updates. Therefore, it is 
reasonable to set the maximum value of Adfu, as 20%. 



6 Conclusion 

In this paper, we have studied the problem of finding the top-A: results in relational 
databases for a continual keyword query. We proposed an approach that finds the an- 
swers whose upper bounds of future relevance scores are larger than a threshold. We 
adopt an existing scheme of finding all the results in a relational database stream, but 
incorporate the ranking mechanisms in the query processing methods and make two 
improvements that can facilitate efficient top-A: keyword search in relational databases. 
The proposed method can efficiently maintain top-k results of a keyword query without 
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re-evaluation. Therefore, it can be used to solve the problem of answering continual 
keyword search in databases that are updated frequently. 
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